colexecerror: allow-list vecindex packages for panic catching#170164
Conversation
|
😎 Merged successfully - details. |
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
| colPackagesPrefix = "github.com/cockroachdb/cockroach/pkg/col" | ||
| encodingPackagePrefix = "github.com/cockroachdb/cockroach/pkg/util/encoding" | ||
| execinfraPackagePrefix = "github.com/cockroachdb/cockroach/pkg/sql/execinfra" | ||
| sqlColPackagesPrefix = "github.com/cockroachdb/cockroach/pkg/sql/col" | ||
| sqlRowPackagesPrefix = "github.com/cockroachdb/cockroach/pkg/sql/row" | ||
| sqlSemPackagesPrefix = "github.com/cockroachdb/cockroach/pkg/sql/sem" | ||
| sqlVecindexPackagePrefix = "github.com/cockroachdb/cockroach/pkg/sql/vecindex" |
There was a problem hiding this comment.
Note: Reformatted by crlfmt
|
What happen when we don't add vecindex in allow list? (which is our current state) By allow-listing vecindex, are we bypassing genuine panics? When
When it make sense to crash the server? Where do the vector processors manipulate shared state or hold locks?
Why
|
mw5h
left a comment
There was a problem hiding this comment.
This PR converts a node crash from vecindex panics into a SQL error — that's a real win for users, and the PR-comment breakdown of locking surfaces in vecindex is the right level of due diligence for an allow-list change.
I have one blocking concern, however: I believe the analysis of Manager.Get is incorrect, and as written this PR converts a node crash into a silent per-index hang on the index-construction path.
Manager.Get is reachable inside CatchVectorizedRuntimeError
The PR comment claims:
Manager.Get() is called from newVectorSearchProcessor() / newVectorMutationSearchProcessor() during processor construction, not during Next() execution. CatchVectorizedRuntimeError only wraps the Next() call path, so panics in Manager.Get() never reach the allow-list check.
Under the default vectorize=on, this is not the case. Verified trace:
pkg/sql/colflow/vectorized_flow.go:1137wraps the body ofsetupFlowincolexecerror.CatchVectorizedRuntimeError.- Line 1199 calls
colbuilder.NewColOperatorwithProcessorConstructor: rowexec.NewProcessor(set at line 1186). colbuilder/execplan.go::supportedNativelyhas no case forcore.VectorSearch, so it returnserrCoreUnsupportedNatively.NewColOperator(line 836) falls into thecreateAndWrapRowSourcebranch.canWrappermitsVectorSearchwrapping (line 345, empty case body).wrapRowSourcesline 108 (toWrap, err := newToWrap(toWrapInputs)) invokes the closure that callsrowexec.NewProcessoreagerly at flow-setup time, not lazily insideNext().rowexec.NewProcessor→newVectorSearchProcessor(pkg/sql/rowexec/vector_search.go:65) →getVectorIndexForSearch→Manager.Get.
Steps 1–7 all run on the same goroutine inside the same CatchVectorizedRuntimeError invocation. Under vectorize=off, the row-based flow doesn't wrap setup, so the risk is specific to the default vectorized path.
Cache-corruption consequence
In pkg/sql/vecindex/manager.go::Get (lines 92–173), if a panic occurs inside the inner closure (e.g. from vecstore.New, cspann.NewIndex, quantize.NewRaBitQuantizer, or any runtime panic on that path):
- The defer chain releases
m.mucleanly. ✓ - But lines 162–172 are skipped, so:
e.mustWaitstaystruee.waitCond.Broadcast()never fires- The cache entry is never cleared
Subsequent Get calls for the same (tableID, indexID) reach e.waitCond.Wait() (line 111) and block forever. The previously loud node crash becomes a silent per-index hang requiring a process restart to recover.
Suggested resolutions (pick one)
-
(recommended) Make
Manager.Getpanic-safe in this PR. Add a deferred cleanup that, on abnormal exit, setse.mustWait = false, populatese.errwith a sentinel error, callse.waitCond.Broadcast(), and removes the cache entry. Sketch:e = &indexEntry{mustWait: true, waitCond: sync.Cond{L: &m.mu}} m.mu.indexes[idxKey] = e var idx *cspann.Index var err error defer func() { if e.mustWait { // Abnormal exit (panic). Unblock waiters and clear the entry // so the next Get retries. e.mustWait = false if err == nil { err = errors.AssertionFailedf("vector index construction panicked") } e.err = err e.waitCond.Broadcast() delete(m.mu.indexes, idxKey) } }()
-
Narrow the allow-list to specific subpackages (e.g.
pkg/sql/vecindex/cspann,pkg/sql/vecindex/vecstore,pkg/sql/vecindex/quantize) so thatManager.Getpanics still crash.
Other observations
- No regression test. The 4-statement repro from #146693 / #146694 would make a clean logictest. The existing dimension-mismatch coverage in
pkg/sql/logictest/testdata/logic_test/vector_indexis on a non-vector-index path; the panic case from the linked issue is specifically theVECTOR INDEX+LIMITshape. - Tangential nit:
pkg/sql/vecindex/manager.go:170writesm.mu.indexes[idxKey] = nilon the error path. This leaves a deadnil*indexEntryin the map (since Go's map index doesn't delete onnilassignment). It happens to work because the subsequente := m.mu.indexes[idxKey]; if e != niltest treats it as missing, but adelete(m.mu.indexes, idxKey)would be cleaner. Pre-existing; not introduced by this PR. - Naming nit posted inline.
(made with /review-crdb)
mw5h
left a comment
There was a problem hiding this comment.
@mw5h reviewed 1 file and all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on cthumuluru-crdb, DrewKimball, KAmbekar, and ZhouXing19).
14da968 to
952fb25
Compare
KAmbekar
left a comment
There was a problem hiding this comment.
No regression test. The 4-statement repro from #146693 / #146694 would make a clean logictest. The existing dimension-mismatch coverage in
pkg/sql/logictest/testdata/logic_test/vector_indexis on a non-vector-index path; the panic case from the linked issue is specifically theVECTOR INDEX+LIMITshape.
To test the allow-list, I added TestVectorIndexPanicCaught. It uses a testing knob to force a panic inside vecindex, then checks the panic is caught instead of crashing the node.
Note: Your check from #146848 already catches this SQL before it reaches vecindex code. The existing test at vector_index:519 covers it. Adding the same SQL again would just test your check, not the allow-list.
Tangential nit:
pkg/sql/vecindex/manager.go:170writesm.mu.indexes[idxKey] = nilon the error path. This leaves a deadnil*indexEntryin the map (since Go's map index doesn't delete onnilassignment). It happens to work because the subsequente := m.mu.indexes[idxKey]; if e != niltest treats it as missing, but adelete(m.mu.indexes, idxKey)would be cleaner. Pre-existing; not introduced by this PR.
Added.
Naming nit posted inline.
Updated.
@KAmbekar made 1 comment.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on cthumuluru-crdb, DrewKimball, mw5h, and ZhouXing19).
⚪ Sysbench [SQL, 3node, oltp_read_write]
Reproducebenchdiff binaries: mkdir -p benchdiff/952fb25/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/952fb2538c38f1b48be8c8f0d3d87e3c6deed56e/bin/pkg_sql_tests benchdiff/952fb25/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/952fb25/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/0c5b33a/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/0c5b33ae3e54ec2f2f9f0acfae9d3a5ef1c00ddb/bin/pkg_sql_tests benchdiff/0c5b33a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/0c5b33a/bin/1058449141/cockroachdb_cockroach_pkg_sql_testsbenchdiff command: # NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/SQL/3node/oltp_read_write$ --old=0c5b33a --new=952fb25 --memprofile ./pkg/sql/tests🔴 Sysbench [KV, 3node, oltp_read_only]
Reproducebenchdiff binaries: mkdir -p benchdiff/952fb25/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/952fb2538c38f1b48be8c8f0d3d87e3c6deed56e/bin/pkg_sql_tests benchdiff/952fb25/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/952fb25/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/0c5b33a/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/0c5b33ae3e54ec2f2f9f0acfae9d3a5ef1c00ddb/bin/pkg_sql_tests benchdiff/0c5b33a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/0c5b33a/bin/1058449141/cockroachdb_cockroach_pkg_sql_testsbenchdiff command: # NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_read_only$ --old=0c5b33a --new=952fb25 --memprofile ./pkg/sql/tests⚪ Sysbench [KV, 3node, oltp_write_only]
Reproducebenchdiff binaries: mkdir -p benchdiff/952fb25/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/952fb2538c38f1b48be8c8f0d3d87e3c6deed56e/bin/pkg_sql_tests benchdiff/952fb25/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/952fb25/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/0c5b33a/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/0c5b33ae3e54ec2f2f9f0acfae9d3a5ef1c00ddb/bin/pkg_sql_tests benchdiff/0c5b33a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/0c5b33a/bin/1058449141/cockroachdb_cockroach_pkg_sql_testsbenchdiff command: # NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_write_only$ --old=0c5b33a --new=952fb25 --memprofile ./pkg/sql/testsArtifactsdownload: mkdir -p new
gcloud storage cp gs://cockroach-microbench-ci/artifacts/952fb2538c38f1b48be8c8f0d3d87e3c6deed56e/26394125893-1/\* new/
mkdir -p old
gcloud storage cp gs://cockroach-microbench-ci/artifacts/0c5b33ae3e54ec2f2f9f0acfae9d3a5ef1c00ddb/26394125893-1/\* old/built with commit: 952fb2538c38f1b48be8c8f0d3d87e3c6deed56e |
952fb25 to
22a88f9
Compare
⚪ Sysbench [SQL, 3node, oltp_read_write]
Reproducebenchdiff binaries: mkdir -p benchdiff/22a88f9/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/22a88f9154d8649d625f9d51e1ad87793e3e01ab/bin/pkg_sql_tests benchdiff/22a88f9/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/22a88f9/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/fc8cddc/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/fc8cddcb212a3204c098e1964bd61aeaf5570ff5/bin/pkg_sql_tests benchdiff/fc8cddc/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/fc8cddc/bin/1058449141/cockroachdb_cockroach_pkg_sql_testsbenchdiff command: # NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/SQL/3node/oltp_read_write$ --old=fc8cddc --new=22a88f9 --memprofile ./pkg/sql/tests⚪ Sysbench [KV, 3node, oltp_read_only]
Reproducebenchdiff binaries: mkdir -p benchdiff/22a88f9/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/22a88f9154d8649d625f9d51e1ad87793e3e01ab/bin/pkg_sql_tests benchdiff/22a88f9/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/22a88f9/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/fc8cddc/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/fc8cddcb212a3204c098e1964bd61aeaf5570ff5/bin/pkg_sql_tests benchdiff/fc8cddc/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/fc8cddc/bin/1058449141/cockroachdb_cockroach_pkg_sql_testsbenchdiff command: # NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_read_only$ --old=fc8cddc --new=22a88f9 --memprofile ./pkg/sql/tests⚪ Sysbench [KV, 3node, oltp_write_only]
Reproducebenchdiff binaries: mkdir -p benchdiff/22a88f9/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/22a88f9154d8649d625f9d51e1ad87793e3e01ab/bin/pkg_sql_tests benchdiff/22a88f9/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/22a88f9/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/fc8cddc/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/fc8cddcb212a3204c098e1964bd61aeaf5570ff5/bin/pkg_sql_tests benchdiff/fc8cddc/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/fc8cddc/bin/1058449141/cockroachdb_cockroach_pkg_sql_testsbenchdiff command: # NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_write_only$ --old=fc8cddc --new=22a88f9 --memprofile ./pkg/sql/testsArtifactsdownload: mkdir -p new
gcloud storage cp gs://cockroach-microbench-ci/artifacts/22a88f9154d8649d625f9d51e1ad87793e3e01ab/26408125903-1/\* new/
mkdir -p old
gcloud storage cp gs://cockroach-microbench-ci/artifacts/fc8cddcb212a3204c098e1964bd61aeaf5570ff5/26408125903-1/\* old/built with commit: 22a88f9154d8649d625f9d51e1ad87793e3e01ab |
22a88f9 to
bcddff6
Compare
⚪ Sysbench [SQL, 3node, oltp_read_write]
Reproducebenchdiff binaries: mkdir -p benchdiff/bcddff6/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/bcddff61a21c6ffd1b6373cd0d5e50e0cf1511e0/bin/pkg_sql_tests benchdiff/bcddff6/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/bcddff6/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/3e97b8a/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/3e97b8ab41072413abcea3f807fabc09efaee330/bin/pkg_sql_tests benchdiff/3e97b8a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/3e97b8a/bin/1058449141/cockroachdb_cockroach_pkg_sql_testsbenchdiff command: # NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/SQL/3node/oltp_read_write$ --old=3e97b8a --new=bcddff6 --memprofile ./pkg/sql/tests⚪ Sysbench [KV, 3node, oltp_read_only]
Reproducebenchdiff binaries: mkdir -p benchdiff/bcddff6/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/bcddff61a21c6ffd1b6373cd0d5e50e0cf1511e0/bin/pkg_sql_tests benchdiff/bcddff6/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/bcddff6/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/3e97b8a/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/3e97b8ab41072413abcea3f807fabc09efaee330/bin/pkg_sql_tests benchdiff/3e97b8a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/3e97b8a/bin/1058449141/cockroachdb_cockroach_pkg_sql_testsbenchdiff command: # NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_read_only$ --old=3e97b8a --new=bcddff6 --memprofile ./pkg/sql/tests⚪ Sysbench [KV, 3node, oltp_write_only]
Reproducebenchdiff binaries: mkdir -p benchdiff/bcddff6/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/bcddff61a21c6ffd1b6373cd0d5e50e0cf1511e0/bin/pkg_sql_tests benchdiff/bcddff6/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/bcddff6/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/3e97b8a/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/3e97b8ab41072413abcea3f807fabc09efaee330/bin/pkg_sql_tests benchdiff/3e97b8a/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/3e97b8a/bin/1058449141/cockroachdb_cockroach_pkg_sql_testsbenchdiff command: # NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_write_only$ --old=3e97b8a --new=bcddff6 --memprofile ./pkg/sql/testsArtifactsdownload: mkdir -p new
gcloud storage cp gs://cockroach-microbench-ci/artifacts/bcddff61a21c6ffd1b6373cd0d5e50e0cf1511e0/26413783108-1/\* new/
mkdir -p old
gcloud storage cp gs://cockroach-microbench-ci/artifacts/3e97b8ab41072413abcea3f807fabc09efaee330/26413783108-1/\* old/built with commit: bcddff61a21c6ffd1b6373cd0d5e50e0cf1511e0 |
mw5h
left a comment
There was a problem hiding this comment.
@mw5h reviewed 11 files and all commit messages, and made 2 comments.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on cthumuluru-crdb, DrewKimball, KAmbekar, and ZhouXing19).
pkg/sql/vecindex/manager.go line 149 at r4 (raw file):
e.err = errors.NewAssertionErrorWithWrappedErrf( cause, "vector index construction panicked") case nil:
nit: this is dead code because panic(nil) becomes *runtime.PanicNilError. This whole section could probably be something like:
r := recover()
cause, ok := r.(error)
if !ok {
// Covers both nil (Goexit-style exit) and non-error panic values.
cause = errors.Newf("non-error panic value: %v", r)
}
e.err = errors.NewAssertionErrorWithWrappedErrf(cause, "vector index construction panicked")
Previously, panics originating from the vecindex packages were not in the colexecerror allow-list, causing them to crash the server instead of being returned as SQL errors. For example, a dimension mismatch query against a vector index would bring down the node. Catching these panics is safe for two reasons. The vecindex search path holds no in-memory locks (it relies on KV-layer locking via vecstore), so a panic mid-operation cannot tear shared state. And Manager.Get — the one in-memory cache on the path — now cleans up after a panic during cache-miss construction: waiters are woken with an error and the failed entry is removed so the next call retries. Caught panics are returned as assertion errors with sentry reports, preserving bug visibility without crashing the server. Resolves: cockroachdb#146694 Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
bcddff6 to
fd8a66e
Compare
⚪ Sysbench [SQL, 3node, oltp_read_write]
Reproducebenchdiff binaries: mkdir -p benchdiff/fd8a66e/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/fd8a66e81046db8c6a3faebfce672294b8b8f392/bin/pkg_sql_tests benchdiff/fd8a66e/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/fd8a66e/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/a081a00/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/a081a00a50d84e8da201bbe55b898ec41b9b013b/bin/pkg_sql_tests benchdiff/a081a00/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/a081a00/bin/1058449141/cockroachdb_cockroach_pkg_sql_testsbenchdiff command: # NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/SQL/3node/oltp_read_write$ --old=a081a00 --new=fd8a66e --memprofile ./pkg/sql/tests⚪ Sysbench [KV, 3node, oltp_read_only]
Reproducebenchdiff binaries: mkdir -p benchdiff/fd8a66e/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/fd8a66e81046db8c6a3faebfce672294b8b8f392/bin/pkg_sql_tests benchdiff/fd8a66e/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/fd8a66e/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/a081a00/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/a081a00a50d84e8da201bbe55b898ec41b9b013b/bin/pkg_sql_tests benchdiff/a081a00/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/a081a00/bin/1058449141/cockroachdb_cockroach_pkg_sql_testsbenchdiff command: # NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_read_only$ --old=a081a00 --new=fd8a66e --memprofile ./pkg/sql/tests⚪ Sysbench [KV, 3node, oltp_write_only]
Reproducebenchdiff binaries: mkdir -p benchdiff/fd8a66e/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/fd8a66e81046db8c6a3faebfce672294b8b8f392/bin/pkg_sql_tests benchdiff/fd8a66e/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/fd8a66e/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
mkdir -p benchdiff/a081a00/bin/1058449141
gcloud storage cp gs://cockroach-microbench-ci/builds/a081a00a50d84e8da201bbe55b898ec41b9b013b/bin/pkg_sql_tests benchdiff/a081a00/bin/1058449141/cockroachdb_cockroach_pkg_sql_tests
chmod +x benchdiff/a081a00/bin/1058449141/cockroachdb_cockroach_pkg_sql_testsbenchdiff command: # NB: for best (most stable) results, also add a suitable `--benchtime` that
# results in ~1s to ~5s of benchmark runs. For example, if ops average ~3ms, a
# benchtime of `1000x` is appropriate.
#
# Some benchmarks (in particular BenchmarkSysbench) output additional memory
# profiles covering only the execution (excluding the setup/teardown) - those
# should be preferred for analysis since they more closely correspond to what's
# reported as B/op and alloc/op.
benchdiff --run=^BenchmarkSysbench/KV/3node/oltp_write_only$ --old=a081a00 --new=fd8a66e --memprofile ./pkg/sql/testsArtifactsdownload: mkdir -p new
gcloud storage cp gs://cockroach-microbench-ci/artifacts/fd8a66e81046db8c6a3faebfce672294b8b8f392/26495148089-1/\* new/
mkdir -p old
gcloud storage cp gs://cockroach-microbench-ci/artifacts/a081a00a50d84e8da201bbe55b898ec41b9b013b/26495148089-1/\* old/built with commit: fd8a66e81046db8c6a3faebfce672294b8b8f392 |
Previously, panics from pkg/sql/vecindex were not in the colexecerror
allow-list, so a bug on the vector-index foreground path (e.g. a
dimension mismatch in a search) crashed the node instead of returning
a SQL error to the client. Add the package prefix to the allow-list.
Also fixes a latent bug in Manager.Get exposed once panics from this
code path are caught instead of fatal: a panic during index construction
previously left parked waiters deadlocked forever, because the normal
mustWait=false / Broadcast path was skipped. Manager.Get now installs a
cleanup defer that wakes waiters with a wrapped error, removes the
half-built cache entry so subsequent callers retry, and re-panics so
the calling goroutine's colexecerror catch still fires.
The allow-list does not cover panics on cspann.FixupProcessor background
goroutines, which run under stopper.RunAsyncTask and continue to crash
the node via stopper.recover() by design. The safety argument is for
the foreground SQL execution path only.
Caught panics surface as assertion errors with sentry reports,
preserving bug visibility without crashing the server.
Resolves: #146694
Epic: none
Release note (bug fix): Vector-index queries that previously crashed
the node on internal errors (e.g. dimension mismatch) now return a SQL
error to the client instead.